Comprehensive and clinically accurate head and neck cancer organs-at-risk delineation on a multi-institutional study

Accurate organ-at-risk (OAR) segmentation is critical to reduce radiotherapy complications. Consensus guidelines recommend delineating over 40 OARs in the head-and-neck (H&N). However, prohibitive labor costs cause most institutions to delineate a substantially smaller subset of OARs, neglecting the dose distributions of other OARs. Here, we present an automated and highly effective stratified OAR segmentation (SOARS) system using deep learning that precisely delineates a comprehensive set of 42 H&N OARs. We train SOARS using 176 patients from an internal institution and independently evaluate it on 1327 external patients across six different institutions. It consistently outperforms other state-of-the-art methods by at least 3–5% in Dice score for each institutional evaluation (up to 36% relative distance error reduction). Crucially, multi-user studies demonstrate that 98% of SOARS predictions need only minor or no revisions to achieve clinical acceptance (reducing workloads by 90%). Moreover, segmentation and dosimetric accuracy are within or smaller than the inter-user variation.

ratio of 1:2 for NAS. Therefore, when considering all 176 patients, the NAS training procedure uses 53% (80%×2/3) for training, 27% (80%×1/3) for validation, and 20% for ablation-testing (never seen in the NAS training). After finalizing the network architecture by the NAS procedure, we retrain the model from scratch using only the searched architecture and set the validation/training sizes to a more typical ratio: 64% for training, 16% for validation, and 20% for ablation-testing. More importantly, please note that the ablation-testing cases (20% of the Training-Validation dataset) were never seen in the NAS training and validation process.
NAS training. We exploit NAS to search the optimal network architecture for each stratified OAR segmentation branch. The combined Dice and Cross-Entropy losses are adopted, and the stochastic gradient descent optimizer is used with a Nesterov momentum of 0.99. To train the NAS parameter αk, we first fix αk to 1/9 for 400 epochs. Then we alternatively update αk and the network weights for another additional 600 epochs. The batch size is set to 2 for NAS training. Only the validation set is used for updating α. The ratio between the training set and the validation set is 2:1. The initial learning rates are set to 0.01 for the anchor and midlevel branches, and 0.005 for the S&H branch, respectively. The learning rate is decayed following the Polynomial learning rate policy.
Final segmentation network training. After NAS is completed, we retrain the searched segmentation network from scratch. Data augmentation is applied 1 , e.g., horizontal flipping, random rotations in the x-y plane within ±10 degrees, intensity scaling with a ratio between [0.75, 1.25], adding Gaussian noise with zero mean and (0, 0.1) variance. The batch size is 2.
The optimizer is stochastic gradient descent with a Polynomial learning rate policy. The initial learning rate is 0.01 with a Nesterov momentum of 0.99. The S&H detection branch is trained using L2 loss with a 0.01 learning rate. The total number of training epochs for each module is 1000. The average training time is 9~10 GPU days. For inference, the average running time is normally less than 3 minutes per patient. All deep models are developed using PyTorch and trained on one NVIDIA Quadro RTX 8000 GPU.

Quantitative ablation results of SOARS in the training-validation dataset
Effect of processing stratification in SOARS. Processing stratification played a key role to improve the OAR segmentation performance. The processing stratification ablation results are shown in Table 2. The baseline is using 3D UNet model (implemented in the nnUNet framework) 1 trained on all 42 OARs together. When anchor OARs were stratified to train only on themselves, there was a 2.4% Dice similarity coefficient (DSC) improvement as compared to the baseline models. When focusing on mid-level OARs, with the help of anchor OAR guidance, there was a significant 37% Hausdorff distance (HD) error reduction (11.4 versus 18.0mm) as compared to the baseline model of training on all OARs. This demonstrated the intrinsic difficulty in segmenting a large number of various organs without explicitly taking their differences into account. It simultaneously indicated that anchor OARs served as effective references to better delineate the hard-to-discern boundaries of mid-level organs (most are softtissue organs). For S&H OARs, by cropping the volume of interest (VOI) using the detection module and with the support of anchor OAR predictions, there were remarkable accuracy improvements in segmenting S&H OARs, boosting DSC from 58.3% to 73.7%, as compared against directly segmenting from CT. This further demonstrated the merits and advantages of our stratified learning approach that adapted to provide the optimal handling of OAR categories with different characteristics. Fig. 3 depicts qualitative examples of segmenting anchor, midlevel and S&H OARs. Table 2 also outlines the performance improvements provided by NAS. As can be seen, all three branches trained with NAS consistently produced more accurate segmentation results than those trained using the baseline 3D UNet. This validated the effectiveness of NAS on more complicated segmentation tasks. For the three branches, mid-level and S&H OAR categories showed considerable performance improvements, from 72.6% to 74.2% and 73.7% to 76.2% in DSC scores respectively, while the anchor branch provides a marginal but consistent improvement (0.7% in DSC). Considering that anchor OARs are already relatively easy to segment, the fact that NAS can further boost the performance attested to its benefits.

Effect of neural architecture search (NAS) associated with SOARS.
The NAS searched neural network architectures are depicted in Supplementary Fig. 1. It is observed that, for the encoding path, the mid-level and S&H branches gradually involve more 3D or P3D convolution kernels as compared to the anchor branch. This indicates that 3D kernels may not always be the best choice for segmenting objects with reasonable size or contrast, as 2D kernels dominate the anchor branch. Consequently, appropriate 2D and P3D kernels can reduce the computation cost and memory consumption. For the S&H branch, our findings are consistent with the intuition that small or low contrast objects rely more on the 3D spatial information and context for better segmentation. As for the decoding path, all three branches are mainly equipped with 3D or P3D convolution kernels. This is an interesting result, as it implies that the decoding path tries to incorporate the convolutional features in a more 3D fashion for all three OAR categories.

Blinded user study to assess the OAR editing efforts
We have further designed another blind user study to assess the observer variation in evaluating the OAR editing efforts. In this blind user study, we used 30 multiuser testing patients from FAH-ZU and involved the senior physician (J. Ge) who has originally drawn the goldstandard contours of 13 OAR types in these patients. For each OAR, we randomly selected its contour from three OAR sources {gold-standard, SOARS, or the other human reader} (see supplementary Fig. 6) and presented it to this physician blindly. The true contour source for each OAR was kept unknown to the physician. We asked the physician to judge if each OAR contour needs editing or not. We report the number of OAR contours required for editing for each OAR source. Results are shown in the supplementary Table 9. From the blind user study, it is observed that there are 15% of the gold-standard contours were deemed requiring further editing, which reflects the intra-observer variation on assessing the OAR revision efforts. For SOARS contours, 43% requires revision, which is slightly higher than that in the original unblind assessment by this physician where 37% SOARS contours required revision among the 13 OAR types of FAH-ZU. Since the required revision number of SOARS contours from the blind vs unblind assessment two times' study is close, it indicates that our observer variation/bias is within a small range. Moreover, compared with SOARS, a noticeably higher number of human reader's contours requires revision (55% vs 43% of SOARS), reflecting that SOARS contours' quality is generally better than the human reader's in the blind assessment. This observation is also consistent with that seen in the quantitative contouring accuracy between SOARS and the human reader ( Table 5 in the main manuscript text). This additional analysis further strengthens our results regarding the OAR editing efforts.

Supplementary Figures
Supplementary Fig. 1 The detailed auto-searched backbone network architecture based on UNet.  Fig. 6. Examples of randomly selected OARs in the blind user study for the observer variation/bias assessment in evaluating the OAR editing efforts. Each OAR in a patient is randomly chosen from one of the three contouring sources {Gold, SOARS, Human-reader}. These OAR contours are presented blindly to the physician to determine if revision for any of the OARs are needed.

Supplementary Tables
Supplementary Table 1. Detailed planning CT imaging protocols in each institution. CE represents contrastenhanced; NC represents non-contrast.  Note: Bold and highlighted values represent the best performance and statistically significant improvements calculated using Wilcoxon matched-pairs signed rank test as compared between UaNet and SOARS, respectively. Statistical significance is set at two-tailed p<0.05.